Methods Summary =============== Preface ------- This document assumes you are using as input the PLCO chip freeze, imputed data freeze, and IMS phenotype data (v10). A description of the methods leading to these files is beyond scope and can be found elsewhere. Preprocessing ------------- Relatedness Estimation ~~~~~~~~~~~~~~~~~~~~~~ Genotype data from the five PLCO platforms were updated to match the variant IDs present in the `graf`_ reference dataset ``G1000GpGeno``. Each chip dataset in turn was converted to `graf`_ fpg format and used to estimate within-platform subject relatedness with ``graf -geno``. .. _graf: https://github.com/ncbi/graf Ancestry Estimation ~~~~~~~~~~~~~~~~~~~ Genotype data from relatedness estimation were used to estimate subject ancestry with ``graf -pop`` and `graf`_ ``PlotPopulations.pl``. As ancestry estimation was conducted separately for each platform, several subjects with borderline ancestry calls had discordant ancestry calls between platforms. In these instances, the ancestry call was resolved by setting the call from the Oncoarray (which happened to always be involved in these discordances) to the calls determined from denser arrays. Chip Cleaning by Ancestry Group ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Genotype data from each of the five PLCO platforms were split by ancestry using the computed `graf`_ ancestry calls. Variants were filtered within platform/ancestry subset, removing variants with minor allele frequency less than 1%, variant-level missingness greater than 2%, or Hard-Weinberg Equilibrium exact p-value less than 0.001 using `plink 1.9`_. The remaining variants were then weakly pruned for linkage equilibrium using successive passes of `plink 1.9`_ commands ``--indep 50 5 2`` and ``--indep-pairwise 50 5 0.2``. Heterozygosity outliers were computed and removed at ``|F| > 0.2``. Principal components within platform/ancestry subset were estimated using `smartpca`_ fastmode. .. _`plink 1.9`: https://www.cog-genomics.org/plink/ .. _`smartpca`: http://data.broadinstitute.org/alkesgroup/EIGENSOFT/ LD Score Regression Reference Files ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 1000 Genomes reference files in GRCh38 tagged ``20181203`` were downloaded from the `1000 Genomes ftp site`_. Genotypes were split out by 1000 Genomes supercontinent (AFR: African; AMR: admixed American; EAS: East Asian; EUR: European; SAS: South Asian), converted to plink format with `plink 1.9`_, updated with genetic map estimates from the hg38 recombination map provided with `BOLT-LMM`_. LD Score estimates for these reference groups were estimated by `ldsc`_ with the following options: ``ldsc.py --maf 0.005 --l2 --ld-wind-cm 1``. For datasets for which these data are available, stock reference data were downloaded from ``https://data.broadinstitute.org/alkesgroup/LDSCORE/``. .. _`1000 Genomes ftp site`: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release .. _`BOLT-LMM`: https://alkesgroup.broadinstitute.org/BOLT-LMM .. _`ldsc`: https://github.com/bulik/ldsc BGEN v1.2 Imputed Data Conversion ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ All primary analysis tools used during initial "Atlas" testing accepted `bgen v1.2`_ format imputed data, so all imputed genotypes were converted to this format with `plink 2`_ in two passes. The data were first converted to ``pgen`` format with ``plink2 --vcf {file} dosage=HDS --id-delim _ --make-pgen erase-phase``; then the resulting pgen files were reformatted to `bgen v1.2`_ with ``plink2 --recode bgen-1.2 bits=8``. .. _`bgen v1.2`: https://www.well.ox.ac.uk/~gav/bgen_format/ .. _`plink 2`: https://www.cog-genomics.org/plink/2.0/ Primary Analysis ---------------- Phenotype Modeling ~~~~~~~~~~~~~~~~~~ Phenotype and covariate data from IMS v10, along with indicator variables reporting genotyping platform batch and ``Other Asian`` raw ancestry calls from `graf`_, were processed and formatted into model matrix files. Continuous traits were inverse normal transformed within ancestry group, stratified by sex, with random resolution of ties. Categorical traits were processed into individual binary contrasts between a single reference group (category 0, with the largest number of subjects); any non-reference group with fewer than 10 subjects was combined into a single meta-group based on the PLCO analysis plan document guidelines. All categorical covariates were similarly processed, and turned into binary covariates to maintain compatibility with analysis tools without direct support for qualitative covariates. Primary Analysis with BOLT-LMM ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For each platform/ancestry combination with at least 3000 subjects, chip subsets corresponding to these data were lifted from GRCh37 to GRCh38 with `liftOver`_. Linear mixed model association with `BOLT-LMM`_ was run with the following parameters: * ``--bgenFile {filename}`` * ``--sampleFile {filename}`` * ``--lmm`` * ``--LDscoresFile {filename}`` * ``--statsFile {filename}`` * ``--statsFileBgenSnps {filename}`` * ``--phenoFile {filename}`` * ``--phenoCol {column name}`` * ``--covarFile {filename}`` * ``--qCovarCol {covariate list}`` * ``--LDscoresMatchBp`` * ``--geneticMapFile {filename}`` .. _`liftover`: http://hgdownload.soe.ucsc.edu/admin/exe/ Primary Analysis with SAIGE ~~~~~~~~~~~~~~~~~~~~~~~~~~~ For each platform/ancestry combination with at least 1000 subjects and 30 cases for a given model matrix, chip subsets corresponding to these data were lifted from GRCh37 to GRCh38 with `liftOver`_. Logistic mixed model association with `SAIGE`_ was run in two passes with the following functions and settings. For round one: * ``SAIGE::fitNULLGLMM`` * ``plinkFile {file prefix}`` * ``phenoFile {filename}`` * ``phenoCol {column name}`` * ``sampleIDColinphenoFile {column name}`` * ``covarColList {covariate list}`` * ``nThreads 4`` * ``traitType "binary"`` * ``outputPrefix {file prefix}`` * ``IsSparseKin TRUE`` * ``relatednessCutoff 0.05`` For round two: * ``SAIGE::SPAGMMATtest`` * ``bgenFile {filename}`` * ``bgenFileIndex {filename}`` * ``vcfField DS`` * ``chrom {chromosome}`` * ``minMAF 0.01`` * ``GMMATmodelFile {filename}`` * ``sampleFile {filename}`` * ``minMAC 1`` * ``varianceRatioFile {filename}`` * ``SAIGEOutputFile {filename}`` * ``IsOutputAFinCaseCtrl TRUE`` * ``sparseSigmaFile {filename}`` .. _`SAIGE`: https://github.com/weizhouUMICH/SAIGE Primary Analysis Postprocessing ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ After each analysis, the native result format was converted to the file format agreed upon with CBIIT. Allele frequencies from raw results were updated to approximate TOPMed reference frequencies estimated from test imputations of 1000 Genomes subjects from each supercontinent versus the TOPMed 5b reference panel. Meta-Analysis ------------- For each continuous and binary phenotype, platform subsets of the same `graf`_ ancestry group were meta-analyzed together with `metal`_ with heterogeneity analysis. For categorical phenotypes, each ancestry group was meta-analyzed across platforms as listed above. Then, each of the (N-1) binary comparisons against the same reference group were combined using a Bonferroni correction on the minimum p-value per variant, correcting for (N-1) tests. This p-value is biased by minimum p-value selection, and should be replaced in future iterations of this analysis. .. _`metal`: https://genome.sph.umich.edu/wiki/METAL LD Score Regression ------------------- Results files from each analysis were processed to contain signed summary statistics. These files were then processed with the `ldsc`_ helper script ``munge_sumstats.py`` using the following parameters: * ``--signed-sumstats STAT,0`` * ``--out {filename}`` * ``--a1-inc`` * ``--frq Freq_Tested_Allele_in_TOPMed`` * ``--N-col N`` * ``--a1 Tested_Allele`` * ``--a2 Other_Allele`` * ``--snp SNP`` * ``--sumstats {filename}`` * ``--p P`` Finally, the resulting processed files were used to estimate LD score regression intercepts with `ldsc`_ script ``ldsc.py`` against reference LD scores from the matched supercontinent.